Climate-Study in Switzerland

Monthly averages of temperature and humidity of 14 wheather stations in Switzerland are available here. It reaches way back to 1867. Let's see what we can learn about

  • seasons
  • climate differences accross Switzerland
  • maybe even climate change.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import statsmodels.formula.api as sm
import patsy as pt
from statsmodels.graphics.gofplots import qqplot
from datetime import datetime

In [2]:
import zipfile
from urllib.request import urlopen

# Get remote data:
url = 'http://data.geo.admin.ch/ch.meteoschweiz.homogenereihen/VQAA60.csv'

# Read into DataFrame:
data = pd.read_csv( urlopen(url), sep='|',
            skiprows=[0,1,2,3],
            na_values=['-'],
            names=['Station', 'Time', 'H', 'T'],
            parse_dates=['Time'], 
            date_parser=lambda d: pd.to_datetime(d, format='%Y%m') )

# statsmodels does not work so well with nans. Get rid of them:
data.dropna( inplace=True )

Data Inspection

First, let's have a look at the data we downloaded:


In [3]:
data[:6]


Out[3]:
Station Time H T
0 BAS 1867-01-01 140.9 -0.2
1 BAS 1867-02-01 66.9 5.7
2 BAS 1867-03-01 70.8 4.9
3 BAS 1867-04-01 92.5 10.0
4 BAS 1867-05-01 73.9 13.6
5 BAS 1867-06-01 140.6 16.1

Ok, we have an integer index, station names (abreviated), and the time of measurement. The actual data columns are H (rain fall in millimeters) and T (temperature in $^{o}$C).

Let's create a hierarchical index using Station and Time. Furthermore, I want to know the month (for later fitting of the season), the year, and the time since the start of the measurements:


In [4]:
data['Month'] = pd.DatetimeIndex(data['Time']).month
data['Year'] = pd.DatetimeIndex(data['Time']).year
dt = data['Time'] - data['Time'].min()
data['dt'] = dt / pd.Timedelta(1, 'Y')

# Drop station 'PAY' as all data there is NaN:
data = data[data['Station'] != 'PAY']

# Create 2-level index (Station, Time):
data = data.set_index(pd.MultiIndex.from_arrays( 
                                 [data['Station'], data['Time']]),
                                 drop=False)

data[:6][['H', 'T', 'Month', 'Time']]


Out[4]:
H T Month Time
Station Time
BAS 1867-01-01 140.9 -0.2 1 1867-01-01
1867-02-01 66.9 5.7 2 1867-02-01
1867-03-01 70.8 4.9 3 1867-03-01
1867-04-01 92.5 10.0 4 1867-04-01
1867-05-01 73.9 13.6 5 1867-05-01
1867-06-01 140.6 16.1 6 1867-06-01

The data for 'BAS' (probably Basel) is shown below. As one would expect, the temperature is cyclic. Summer is warmer than winter. The precipitation looks mainly noisy. Taking means for each month (see table), however, hints that there is systematically more precipitation in summer than in winter.


In [5]:
fig, ax1 = pl.subplots()

pl.title('Temperature in BAS')
ax1.set_xlabel('Date')
ax1.set_ylabel('Temperature ($^{o}$ C)', color='r')
ax1.plot_date(data.index.levels[1].astype(datetime), data.loc['BAS']['T'],
              r'ro-', label='T')

ax2 = ax1.twinx()
ax2.set_ylabel('Rainfall (mm)', verticalalignment='bottom',
               rotation=-90, color='b')
ax2.plot_date(data.index.levels[1].astype(datetime), data.loc['BAS']['H'],
              r'bo-', label='H')

ax1.set_xlim(['1867', '1877'])

data.groupby('Month').aggregate(['mean', 'std'])[['H', 'T']]


Out[5]:
H T
mean std mean std
Month
1 83.541880 78.870473 -2.791613 3.985315
2 76.770699 75.877096 -1.747151 4.510225
3 86.466703 75.698482 1.391286 4.805732
4 97.498870 75.366269 4.980796 5.069517
5 113.130931 72.567541 9.309736 5.001985
6 128.720710 73.514679 12.681926 4.971962
7 130.652366 79.420809 14.728925 4.937872
8 134.391237 79.178608 14.160968 4.635340
9 110.250134 76.685653 11.091134 4.348568
10 106.498763 86.235033 6.497473 3.957268
11 101.200269 89.541348 1.626667 3.797036
12 95.625767 86.927008 -1.742711 3.868841

Taking means by stations tells the warmer from the colder and the wetter from the drier locatiats:


In [6]:
data.groupby('Station').aggregate( ['mean', 'std'] )[ ['H', 'T'] ]


Out[6]:
H T
mean std mean std
Station
BAS 66.149469 39.326102 9.458021 6.550368
BER 83.489324 48.481764 7.869424 6.855973
CHD 109.260498 58.977209 5.366891 6.493301
CHM 101.421968 57.145141 5.460928 6.113159
DAV 81.542727 51.780745 2.595568 6.438817
ENG 129.585020 66.342045 5.530520 6.447743
GSB 183.151774 101.087606 -1.264591 5.872440
GVE 77.474455 48.267088 9.595975 6.741932
LUG 132.004192 100.705848 11.701174 6.687905
SAE 218.056590 120.579510 -2.033042 5.261947
SIA 81.794966 58.360747 1.611521 6.677687
SIO 47.418893 35.199694 9.074399 7.321170
SMA 88.480324 51.246788 8.428675 6.779093

Seasons and Geographical Differences

The temperature is seasonal and depends on the station. So lets fit these dependencies.


In [7]:
# Helpers
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

def _extractLevels(categoricalVars):
    out = [name.split('[')[1][:-1] for name in categoricalVars]
    return(out)

def extractLevels(categoricalVarNames):
    """
    Extracts the level (or levels, in case of interaction terms from a list
    of column names (of a design matrix created by patsy).
    
    Example:
    
    input: [ 'C(Month)[T.2]:Station[T.BER]', 'C(Month)[T.3]:Station[T.BER]' ]
    output: [ ('2', 'BER'), ('3', 'BER') ]
    """
    
    out = [name.split(':') for name in categoricalVarNames]
    out = [_extractLevels(name) for name in out]
    return(out)
    

# Season:Station
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

data.dropna(inplace=True)

y, X = pt.dmatrices('T ~ C(Month):Station - 1', data=data)
model = sm.OLS(y, X)
fit = model.fit()
# print(fit.summary())

interactionSlice = X.design_info.term_name_slices['C(Month):Station']
interactionTemp = fit.params[interactionSlice]
interactionNames = X.design_info.column_names[interactionSlice]
interactionNames = extractLevels(interactionNames)
interactionNames = list(zip(*interactionNames))

data['interaction'] = data['Month'].astype('str') + ':' + data['Station']

pl.figure()
pl.title('Interaction Temperature-Deviation from January in BAS')
pl.xlabel('Month (Nr.)')
pl.ylabel('Temperature Deviation ($^{o}$ C)')

effects = fit.params.reshape(-1, 12)
pl.plot(effects.T, r'o-')

pl.legend(interactionNames[1][::12]);


Having fitted the obvious dependencies, we now - hopefully - have a clearer look at the non-obvious stuff. Let's diagnose the validity of the regression, first (see plots below):

  • The Q-Q-plot indicates a tail towards negative residuals
  • Residuals vs Month and Station are unsuspicious
  • Residuals vs time indicate an unmodeled dependency (temperature seems to increase over the century)
  • Residuals vs fitted values shows a somewhat larger variance at temperatures below ~2$^o$C

In [8]:
data['fit'] = fit.predict(exog=X)
data['resid'] = data['T'] - data['fit']

fig, ax = pl.subplots()
qqplot(data['resid'], ax=ax)

pl.figure()
pl.title('Residuals vs Month');
pl.plot(data['Month'], data['resid'], r'o')
pl.xlabel('Month (1)');
pl.ylabel('Residual ($^o$C)');

pl.figure()
pl.title('Residuals vs Station');
residStation = data.pivot(index='Time', columns='Station', values='resid')
pl.boxplot(np.asarray(residStation))
pl.xticks(range(1, len(residStation.columns)+1), residStation.columns)
pl.xlabel('Station');
pl.ylabel('Residual ($^o$C)');

pl.figure()
pl.plot_date(data['Time'].astype(datetime), data['resid'], r'o')
pl.xlabel('Time (years)');
pl.ylabel('Residual ($^o$C)');

pl.figure()
pl.plot(data['fit'], data['resid'], r'o')
pl.xlabel('Fitted Values ($^{o}$ C)');
pl.ylabel('Residual ($^o$C)');


Climate Change

Now we re-fit the data with a model that additionally includes a linear temperature trend. We find that the additional fit-parameter improves the fit and is significanty different from zero:


In [9]:
modelClimateChange = sm.ols('T ~ C(Month):Station + dt - 1', data=data)
fitCC = modelClimateChange.fit()
#tuple(fitCC.conf_int().loc['dt'])
'Temperature Drift: ({0:.2f} +/- {1:.2f}) K/century (95% Conf.)'.format(
                            fitCC.params['dt']*1e2, fitCC.bse['dt']*1e2)


Out[9]:
'Temperature Drift: (1.35 +/- 0.03) K/century (95% Conf.)'